-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Pipeline Parallel Support #4412
Conversation
@andoorve - Exciting!!! |
@andoorve thanks for the effort! Can you write an RFC to describe the overall design so that people can easily understand it? example rfcs: https://github.com/vllm-project/vllm/issues?q=label%3ARFC+sort%3Aupdated-desc |
@youkaichao Yes for sure, it is one of the TODO items above |
vllm/worker/model_runner.py
Outdated
@@ -746,7 +763,8 @@ def execute_model( | |||
logits = self.model.compute_logits(hidden_states, sampling_metadata) | |||
|
|||
# Only perform sampling in the driver worker. | |||
if not self.is_driver_worker: | |||
if (not (is_pipeline_model_parallel_last_rank() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so for tp, the first rank (driver) performs sampling, and for pp, the last rank (the last worker in the last pp's tp group) performs sampling, is this correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the first worker of the last PP's TP group
Updated the RFC here: #4461 @youkaichao Let me know if anything needs further elaboration |
FYI pretty sure PyTorch has a bug, filed here: pytorch/pytorch#125079 Worked around this last week by making sending and receiving phase for each model atomic by concatenating residuals and hidden states. |
Sounds good @youkaichao, I can update mine once that's merged. Will you also include the change to create the multiple CPU TP groups or should I create a separate PR? |
Yes, that's also in my plan. I will break #4460 down into small pieces to be merged, ETA this week. |
Sounds good - I'll revert the PyNCCL changes on this PR and wait for that to be merged to add in |
Hey @andoorve - This is super exciting! I'm trying to run a simple example with - llm = LLM(model="facebook/opt-125m", load_format="dummy")
+ llm = LLM(model="facebook/opt-2.7b", pipeline_parallel_size=2, load_format="dummy") This is the error I hit: error.txt. It seems like it's complaining the
ERROR 05-01 20:45:18 worker_base.py:147] Traceback (most recent call last):
ERROR 05-01 20:45:18 worker_base.py:147] File "/workspace/vllm/vllm/worker/worker_base.py", line 139, in execute_method
ERROR 05-01 20:45:18 worker_base.py:147] return executor(*args, **kwargs)
ERROR 05-01 20:45:18 worker_base.py:147] File "/workspace/vllm-pp-venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 05-01 20:45:18 worker_base.py:147] return func(*args, **kwargs)
ERROR 05-01 20:45:18 worker_base.py:147] File "/workspace/vllm/vllm/worker/worker.py", line 140, in determine_num_available_blocks
ERROR 05-01 20:45:18 worker_base.py:147] self.model_runner.profile_run()
ERROR 05-01 20:45:18 worker_base.py:147] File "/workspace/vllm-pp-venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 05-01 20:45:18 worker_base.py:147] return func(*args, **kwargs)
ERROR 05-01 20:45:18 worker_base.py:147] File "/workspace/vllm/vllm/worker/model_runner.py", line 844, in profile_run
ERROR 05-01 20:45:18 worker_base.py:147] self.execute_model(seqs, kv_caches)
ERROR 05-01 20:45:18 worker_base.py:147] File "/workspace/vllm-pp-venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 05-01 20:45:18 worker_base.py:147] return func(*args, **kwargs)
ERROR 05-01 20:45:18 worker_base.py:147] File "/workspace/vllm/vllm/worker/model_runner.py", line 763, in execute_model
ERROR 05-01 20:45:18 worker_base.py:147] hidden_states = model_executable(**execute_model_kwargs)
ERROR 05-01 20:45:18 worker_base.py:147] File "/workspace/vllm-pp-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
ERROR 05-01 20:45:18 worker_base.py:147] return self._call_impl(*args, **kwargs)
ERROR 05-01 20:45:18 worker_base.py:147] File "/workspace/vllm-pp-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
ERROR 05-01 20:45:18 worker_base.py:147] return forward_call(*args, **kwargs)
ERROR 05-01 20:45:18 worker_base.py:147] File "/workspace/vllm/vllm/model_executor/models/opt.py", line 300, in forward
ERROR 05-01 20:45:18 worker_base.py:147] hidden_states = self.model(input_ids, positions, kv_caches,
ERROR 05-01 20:45:18 worker_base.py:147] File "/workspace/vllm-pp-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
ERROR 05-01 20:45:18 worker_base.py:147] return self._call_impl(*args, **kwargs)
ERROR 05-01 20:45:18 worker_base.py:147] File "/workspace/vllm-pp-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
ERROR 05-01 20:45:18 worker_base.py:147] return forward_call(*args, **kwargs)
ERROR 05-01 20:45:18 worker_base.py:147] File "/workspace/vllm/vllm/model_executor/models/opt.py", line 275, in forward
ERROR 05-01 20:45:18 worker_base.py:147] return self.decoder(input_ids, positions, kv_caches, attn_metadata)
ERROR 05-01 20:45:18 worker_base.py:147] File "/workspace/vllm-pp-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
ERROR 05-01 20:45:18 worker_base.py:147] return self._call_impl(*args, **kwargs)
ERROR 05-01 20:45:18 worker_base.py:147] File "/workspace/vllm-pp-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
ERROR 05-01 20:45:18 worker_base.py:147] return forward_call(*args, **kwargs)
ERROR 05-01 20:45:18 worker_base.py:147] File "/workspace/vllm/vllm/model_executor/models/opt.py", line 249, in forward
ERROR 05-01 20:45:18 worker_base.py:147] hidden_states = layer(hidden_states, kv_caches[i], attn_metadata)
ERROR 05-01 20:45:18 worker_base.py:147] IndexError: list index out of range
I haven't dug into the code deep enough, and curious what is the best way to test and play around with it. If you can point me to some potential starting point, that would be awesome enough. Thanks! |
Hey @GindaChen there's a couple of things here, We haven't supported OPT yet, also the LLMEngine entry point won't work. We're only supporting AsyncLLMEngine right now |
The way I would recommend is try with the online serving entrypoint with the LLaMa model. That'd be the best way to start playing around with it |
LGTM - I guess one thing we can add is PP PyNCCL group |
That's in my plan. Which operation do you need for pp? allreduce? gather? or anything else? |
We only need point-to-point, blocking send and blocking recv only. It's not critical though unless |
Hi @andoorve, While benchmarking using your PR, I've consistently encountered engine timeouts with smaller models on setups far below total VRAM capacity, which might relate to the issues you've linked (e.g., [Bug]: Engine iteration timed out #4293, #4430, #4135). I'm using commit 9d698fa. Setup and Reproduction:
python -m vllm.entrypoints.openai.api_server --model JackFram/llama-160m \
--swap-space 16 \
--disable-log-requests \
--pipeline-parallel-size 2 python benchmarks/benchmark_serving.py --backend vllm --model JackFram/llama-160m \
--dataset-name sharegpt \
--dataset-path /workspace/sharegpt.json \
--num-prompts 3 Observation: Proposed Solution: I traced the issue to Branch with fix: https://github.com/SolitaryThinker/vllm/tree/pipeline-parallel-fix I noticed a new commit from you regarding TP+PP fix, but it didn’t resolve the issue in my environment. Could it be due to missing the latest pynccl changes with groups #4512? This is my first time handling VLLM and Ray, so any insights or corrections on my understanding or approach would be greatly appreciated. Additional technical details: done, _ = await asyncio.wait(requests_in_progress, return_when=asyncio.FIRST_COMPLETED) call still could have workers running when a new engine_step task for the VE is created. I'm not sure the exact interaction that causes the hanging, but inserting a Thanks |
Thanks for the thorough investigation and the fix! It's indeed true that there are existing issues with hanging on the current vLLM mainline, and I have not rebased on the latest PyNCCL changes yet. I also am unable to reproduce this issue easily with GPT2 when I try with my own testing. For these reasons I haven't investigated as deeply yet. I'll give your setup and fix a try once I check if multi-node is functional. I wonder if this is a similar reason as to why the TP-only cases are hanging in the issues mentioned above since there is no such |
FYI: I recently find clean up logic is prone to hang, and this is "fixed" in #4508 . |
@SolitaryThinker I tried the model/commands above that are giving you issues. I was unable to reproduce on my setup. My SetupStarted a fresh instance with the following: GCP g2-standard-48 (4 x NVIDIA L4) ExperimentsStarted vLLM with
Ran the below 3 times:
Killed vLLM server then repeated the above experiment 2 more times for a total of 3 separate serving instances, 9 benchmark tries, and 27 total requests sent. See expected benchmark results each time:
I wonder if it might only be reproducible on other instances... needs further investigation though. |
A very meaningful feature. Here is the command:
And here is error stack:
|
@zhengxingmao Thanks for reporting this! Does this happen without PP? If not, I think it could be some interaction with the following flags with PP. Can you try without these flags and use a model directly from HF? (LLaMa) |
I did some investigation into what you were saying. I think there are real hangs that appear. I tried LLaMa 3 8B with effectively infinite request rate on 2 L4s and saw hangs - not sure if this is the same situation that you found yourself in. Strangely, if I did a warm up request first, the hang went away. The
Also from here, asyncio methods such as I resolved a hang on my end with: Maybe this helps for you? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really awesome work thanks @andoorve. None of my comments need to block merging this. I'd be happy to try out my suggestions post-merge.
Would also be nice to fast-follow with single-node support for multiprocessing backend, I think that should be trivial.
Looking forward to seeing the perf test results!
vllm/model_executor/models/llama.py
Outdated
if get_pp_group().is_last_rank: | ||
hidden_states, _ = self.norm(hidden_states, residual) | ||
return hidden_states | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: redundant else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed
vllm/worker/model_runner.py
Outdated
return self.output_buffers["hidden_states"] | ||
if get_pp_group().is_last_rank: | ||
return self.output_buffers["hidden_states"] | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: redundant else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed
remote() # type: ignore | ||
) | ||
else: | ||
await self.engine.stop_remote_worker_execution_loop_async() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andoorve I wonder whether we could keep this within the engine and have each TP group stop/start their worker execution loops independently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@njhill We could do that, we should talk more about this. I guess we can just pass in a TP parameter for that case. Also, how would this level of granularity help?
self.scheduler = [ | ||
Scheduler(scheduler_config, cache_config, lora_config, | ||
parallel_config.pipeline_parallel_size) | ||
for _ in range(parallel_config.pipeline_parallel_size) | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to encapsulate all of these loops in a MultiScheduler
class that overrides all the necessary scheduler methods.. could either be a subclass of we could introduce ABC or protocol for this.
That way we can also use singular scheduler directly in PP=1 case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I thought the same. I defaulted to List of scheduler because there weren't too many methods that were necessary in this case but it's definitely a good refactor for readability.
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
c92257c
to
f9351fb
Compare
Hi! I am super happy to see VLLM is going to support Pipeline Parallel. I tried this PR fork, and got this error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the hardwork! very excited to see this new feature ❤️
@wukaixingxp Certainly! Please look for follow up PR |
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
@andoorve hi, I have some confusions about the PP implementation in this PR, hope you can help make clear. Why is “coroutine + async” implementation adopted here, other than torch way "multiprocess + rpc"?What considerations led to this choice? Is it because this way aligns better with the current architecture of vLLM? And, do you have |
Hi @MuYu-zhi thanks for your question. Basically, it's due to two reasons, it naturally aligns very well with the existing vLLM async architecture that was already there as well as the fact that it's quite simple to do. We don't have data on multiprocess + RPC but CPU usage with async seems to be high. This may be addressed by other PRs. |
Hi Thanks for the great PR! Looks like adding pipeline parallel for Mixtral is pretty straightforward based on the code change for llama class. Is there someone already working on this expansion? Otherwise I can push my local change as well. |
Thanks @binxuan! Please push your local changes |
Hey @binxuan, you can directly create the PR, please paste the responses in that PR. We are still working on correctness tests |
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai> Signed-off-by: Alvant <alvasian@yandex.ru>
Adds initial pipeline parallelism support to vLLM.
ToDo:
Milestone 1: POC Prototype
worker.py
,llm_engine.py
,async_llm_engine.py
and block managers.ray_gpu_executor.py
,worker.py
andmodel_runner.py
to support multiple driver workersMilestone 2: Mergeable
FIX #4461
Goals for this PR:
Non-goals for this PR (To be covered in future PRs)
cc: @zhuohan123 @WoosukKwon @simon-mo @youkaichao